A Very Very Large Corpus Doesn't Always Yield Reliable Estimates

نویسندگان

James R. Curran

Miles Osborne

چکیده

Banko and Brill (2001) suggested that the development of very large training corpora may be more effective for progress in empirical Natural Language Processing than improving methods that use existing smaller training corpora. This work tests their claim by exploring whether a very large corpus can eliminate the sparseness problems associated with estimating unigram probabilities. We do this by empirically investigating the convergence behaviour of unigram probability estimates on a one billion word corpus. When using one billion words, as expected, we do find that many of our estimates do converge to their eventual value. However, we also find that for some words, no such convergence occurs. This leads us to conclude that simply relying upon large corpora is not in itself sufficient: we must pay attention to the statistical modelling as well.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

How textbooks (and learners) get it wrong: A corpus study of modal auxiliary verbs

Many elements contribute to the relative difficulty in acquiring specific aspects of English as a foreign language (Goldschneider & DeKeyser, 2001). Modal auxiliary verbs (e.g. could, might), are examples of a structure that is difficult for many learners. Not only are they particularly complex semantically, but especially in the Malaysian context ...

متن کامل

Reverse Peroneal Artery Flap for Large Heel and Sole Defects: A Reliable Coverage

BACKGROUND Large soft tissue defects of ankle and foot always have been challenging to reconstruct. Reverse sural flaps, free flaps have been used for this problem with variable success. Reverse peroneal artery flap is an option to use with reliability without microvascular repair. Connections of peroneal artery around talus and ankle joint are deep and reliable with anterior tibial and post...

متن کامل

A Modification on Applied Element Method for Linear Analysis of Structures in the Range of Small and Large Deformations Based on Energy Concept

In this paper, the formulation of a modified applied element method for linear analysis of structures in the range of small and large deformations is expressed. To calculate deformations in the structure, the minimum total potential energy principle is used. This method estimates the linear behavior of the structure in the range of small and large deformations, with a very good accuracy and low...

متن کامل

Anchor-Free Correlated Topic Modeling: Identifiability and Algorithm

In topic modeling, many algorithms that guarantee identifiability of the topics have been developed under the premise that there exist anchor words – i.e., words that only appear (with positive probability) in one topic. Follow-up work has resorted to three or higher-order statistics of the data corpus to relax the anchor word assumption. Reliable estimates of higher-order statistics are hard t...

متن کامل

Data Mining at the Intersection of Psychology and Linguistics

Large data resources play an increasingly important role in both linguistics and psycholinguistics. The first data resources used by both psychologists and linguists alike were word frequency lists such as Thorndike and Lorge (1944) and Kučera and Francis (1967). Although the Brown corpus on which the frequency counts of Kučera and Francis were based was very large for its time, comprising some...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2002

A Very Very Large Corpus Doesn't Always Yield Reliable Estimates

نویسندگان

چکیده

منابع مشابه

How textbooks (and learners) get it wrong: A corpus study of modal auxiliary verbs

Reverse Peroneal Artery Flap for Large Heel and Sole Defects: A Reliable Coverage

A Modification on Applied Element Method for Linear Analysis of Structures in the Range of Small and Large Deformations Based on Energy Concept

Anchor-Free Correlated Topic Modeling: Identifiability and Algorithm

Data Mining at the Intersection of Psychology and Linguistics

عنوان ژورنال:

اشتراک گذاری